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Abstract Exploiting the performance of today's microprocessors requires 
intimate knowledge of the microarchitecture as well as an awareness of the 
ever-growing complexity in thread and cache topology. LIKWID is a set of 
command line utilities that addresses four key problems: Probing the thread 
and cache topology of a shared-memory node, enforcing thread-core affin- 
ity on a program, measuring performance counter metrics, and microbench- 
marking for reliable upper performance bounds. Moreover, it includes an 
mpirun wrapper allowing for portable thread-core affinity in MPI and hybrid 
MPI/threaded applications. To demonstrate the capabilities of the tool set 
we show the influence of thread affinity on performance using the well-known 
OpenMP STREAM triad benchmark, use hardware counter tools to study 
the performance of a stencil code, and finally show how to detect bandwidth 
problems on ccNUMA-based compute nodes. 



1 Introduction 



Today's multicore x86 processors bear multiple complexities when aiming for 
high performance. Conventional performance tuning tools like Intel VTunc, 
OProfile, CodeAnalyst, OpcnSpeedshop, etc., require a lot of experience in 
order to get sensible results. For this reason they are usually unsuitable for 
the scientific users, who would often be satisfied with a rough overview of the 
performance properties of their application code. Moreover, advanced tools 
often require kernel patches and additional software components, which make 
them unwieldy and bug-prone. Additional confusion arises with the complex 
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multicorc, multicache, multisocket structure of modern systems (see Fig. 1); 
users are all too often at a loss about how hardware thread IDs are assigned 
to resources like cores, caches, sockets and NUMA domains. Moreover, the 
technical details of how threads and processes are bound to those resources 
vary strongly across compilers and MPI libraries. 

LIKWID ("Like I Knew What I'm Doing") is a set of easy to use com- 
mand line tools to support optimization. It is targeted towards performance- 
oriented programming in a Linux environment, does not require any kernel 
patching, and is suitable for Intel and AMD processor architectures. Multi- 
threaded and even hybrid shared/distributed-memory parallel code is sup- 
ported. LIKWID comprises the following tools: 

• likwid-f eatures can display and alter the state of the on-chip hardware 
prefetching units in Intel x86 processors. 

• likwid— topology probes the hardware thread and cache topology in mul- 
ticorc, multisocket nodes. Knowledge like this is required to optimize re- 
source usage like, e.g., shared caches and data paths, physical cores, and 
ccNUMA locality domains in parallel code. 

• likwid-perf Ctr measures performance counter metrics over the complete 
runtime of an application or, with support from a simple API, between 
arbitrary points in the code. Although it is possible to specify the full, 
hardware-dependent event names, some predefined event sets simplify mat- 
ters when standard information like memory bandwidth or Flop counts is 
needed. 

• likwid-pin enforces thread-core affinity in a multi-threaded application 
"from the outside," i.e., without changing the source code. It works with 
all threading models that are based on POSIX threads, and is also compat- 
ible with hybrid "MPI+thrcads" programming. Sensible use of likwid-pin 
requires correct information about thread numbering and cache topology, 
which can be delivered by likwid-topology (see above). 

• likwid-mpirun allows to pin a pure MPI or hybrid MPI/threaded appli- 
cation to dedicated compute resources in an intuitive and portable way. 

• likwid-bench is a microbenchmarking framework allowing rapid proto- 
typing of small assembly kernels. It supports threading, thread and mem- 
ory placement, and performance measurement, likwid-bench comes with 
a wide range of typical benchmark cases and can be used as a stand-alone 
benchmarking application. 

Although the six tools may appear to be partly unrelated, they solve the 
typical problems application programmers encounter when porting and run- 
ning their code on complex multicorc/multisocket environments. Hence, we 
consider it a natural idea to provide them as a single tool set. 

This paper is organized as follows. Section 2 describes two of the tools in 
some detail and gives hints for typical use. Section 3 demonstrates the use of 
LIKWID in three different case studies, and Section 4 gives a summary and 
an outlook to future work. 
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Fig. 1: Cache and thread topology of Intel Core 2 Quad and Nehalem EP 
Westmere processors. 



2 Tools 



LIKWID only supports x86-based processors. Given the strong prevalence of 
those architectures in the HPC market (e.g., 90% of all systems in the latest 
Top 500 list are of x86 type) we do not consider this a severe limitation. In 
other areas like, e.g., workstations or desktops, the x86 dominance is even 
larger. 

An important concept shared by all tools in the set is logical numbering 
of compute resources inside so-called thread domains. Under the Linux OS, 
hardware threads in a compute node are numbered according to some scheme 
that heavily depends on the BIOS and kernel version, and which may be 
unrelated to natural topological units like cache groups, sockets, etc. Since 
users naturally think in terms of topological structures, LIKWID introduces a 
simple and yet flexible syntax for specifying processor resources. This syntax 
consists of a prefix character and a list of logical IDs, which can also include 
ranges. The following domains are supported: 

Node N 

Socket S [0-9] 

Last level shared cache C [0-9] 

NUMA domain M [0-9] 

Multiple ID lists can be combined, allowing a flexible numbering of compute 
resources. To indicate, e.g., the first two cores of NUMA domains 1 and 3, 
the following string can be used: M0 : , 1@M2 : , 1. 

In the following we describe two of the six tools in more detail. A thorough 
documentation of all tools apart from the man pages is found on the WIKI 
pages on the LIKWID homepage [1]. 
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2.1 likwid-perfctr 

Hardware-specific optimization requires an intimate knowledge of the mi- 
croarchitecture of a processor and the characteristics of the code. While many 
problems can be solved with profiling, common sense, and runtime measure- 
ments, additional information is often useful to get a complete picture. 

Performance counters are facilities to count hardware events during code 
execution on a processor. Since this mechanism is implemented directly in 
hardware there is no overhead involved. All modern processors provide hard- 
ware performance counters. They are attractive for application programmers, 
because they allow an in-depth view on what happens on the processor while 
running applications. As shown below, likwid-perfctr has practically zero 
overhead since it reads performance metrics at predefined points. It does not 
support statistical counter sampling. At the time of writing, likwid-perfctr 
runs on all current x86-based architectures. 

The probably best known and widespread existing tool is the PAPI library 
[6, 5]. A lot of research is targeted towards using hardware counter data for 
automatic analysis and detecting potential performance bottlenecks [2, 3, 4]. 
However, those solutions are often too unwieldy for the common user, who 
would prefer a quick overview as a first step in performance analysis. A key 
design goal for likwid-perfctr was ease of installation and use, minimal sys- 
tem requirements (no additional kernel modules and patches), and — at least 
for basic functionality — no changes to the user code. A prototype for the 
development of likwid-perfctr is the SGI tool "perfex," which was avail- 
able on MlPS-bascd IRIX machines as part of the "SpeedShop" performance 
suite. Cray provides a similar, PAPI-based tool (craypat) on their systems [8]. 
likwid-perfctr is a dedicated command line tool for programmers, allow- 
ing quick and flexible measurement of hardware performance counters on 
x86 processors, and is available as open source. It allows simultaneous mea- 
surements on multiple cores. Events that are shared among the cores of a 
socket (this pertains to the "uncore" events on Core i7-type processors) are 
supported via "socket locks," which enforce that all uncore event counts are 
assigned to one thread per socket. Events are specified on the command line, 
and the number of events to count concurrently is limited by the number 
of performance counters on the CPU. These features are available without 
any changes in the user's source code. A small instrumentation ("marker") 
API allows one to restrict measurements to certain parts of the code (named 
regions) with automatic accumulation over all regions of the same name. An 
important difference to most existing performance tools is that event counts 
are strictly core-based instead of process-based: Everything that runs and 
generates events on a core is taken into account; no attempt is made to filter 
events according to the process that caused them. The user is responsible for 
enforcing appropriate affinity to get sensible results. This could be achieved 
with likwid-perfctr itself or alternatively via likwid-pin (see below for 
more information): 
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$ likwid-perf ctr -C S0:0 \ 

-g SIMD_COMP_INST_RETIRED_PACKED_DOUBLE : PMCO , \ 

SIMD_COMP_INST_RETIRED_SCALAR_DOUBLE : PMC1 . /a . out 

(See below for typical output in a more elaborate setting.) In this example, 
the computational double precision packed and scalar SSE retired instruction 
counts on an Intel Core 2 processor are assigned to performance counters 
and 1 and measured on the first core (ID 0) of the first socket (domain SO) 
over the duration of a. out's runtime. As a side effect, it becomes possible 
to use likwid-perf ctr as a monitoring tool for a complete shared- memory 
node, just by specifying all cores for measurement and, e.g., "sleep" as the 
application: 

$ likwid -perf ctr -c N:0-7 \ 

-g SIMD_COMP_INST_RETIRED_PACKED_DOUBLE : PMCO , \ 
SIMD_COMP_INST_RETIRED_SCALAR_DOUBLE : PMC1 \ 
sleep 1 

Apart from naming events as they are documented in the vendor's manu- 
als, it is also possible to use preconfigured event sets (groups) with derived 
metrics. This provides a simple abstraction layer in cases where standard 
information like memory bandwidth, Flops per second, etc., is sufficient: 

$ likwid-perf ctr -C N:0-3 -g FL0PS_DP ./a. out 

The event groups are partly inspired from a technical report published by 
AMD [7], and all supported groups can be obtained by using the -a com- 
mand line switch. We try to provide the same preconfigured event groups 
on all supported architectures, as long as the native events support them. 
This allows the beginner to concentrate on the useful information right away, 
without the need to look up events in the manuals (similar to PAPI's high- 
level events). In the usage scenarios described so far there is no interference 
of likwid-perf ctr while user code is being executed, i.e., the overhead is 
very small (apart from the unavoidable API call overhead in marker mode). 

The following example illustrates the use of the marker API in a serial 
program with two named regions ("Main" and "Accum"): 

#include <likwid.h> 

int corelD = 1 ikwid_pr oc es sGet Pr o ce s sor Id ( ) ; 
likwid_markerlnit (number Of Threads , number Of Regions ) ; 
int Mainld = 1 ikwid_mar ker Regi st erRegi on ( " Main " ) ; 
int Accumld = 1 ikwid_mar ker Regi st erRegi on (" Ac cum ") ; 

likwid_markerStartRegion(0, corelD) ; 
// measured code region "Main" here 
likwid_markerStopRegion (0 , corelD , Mainld); 



for (j = 0; j < N; { 

1 ikwid_mar ker St ar t Regi on (0 , corelD) ; 
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// measured code region "Accum" here 
likwid_markerStopRegion(0, corelD , Accum Id) ; 

} 

likwid_markerClose () ; 

Event counts are automatically accumulated on multiple calls. Nesting or 
partial overlap of code regions is not allowed. The API requires specification 
of a thread ID (0 for one process only in the example) and the core ID of the 
thread/process. The LIKWID API provides simple functions to determine 
the core ID of processes or threads. The following listing shows the shortened 
output of likwid-perf ctr after measurement of the FL0PS_DP event group 
on four cores of an Intel Core 2 Quad processor in marker mode with two 
named regions ("Init" and "Benchmark," respectively): 

$ likwid-perf Ctr -c 0-3 -g FL0PS_DP -m ./a. out 

CPU type: Intel Core 2 45nm processor 

CPU clock: 2.83 GHz 

Measuring group FLOPS.DP 



INSTR_RETIRED_ANY I 313742 I 376154 | 355430 | 341988 

CPU_CLK_UNHALTED_CORE I 217578 | 504187 | 477785 | 459276 



I Runtime [s] | 7.67906e-05 | 0. 000177945 | 0. 000168626 I 0. 000162094 

I CPI | . 693493 | 1 . 34037 | 1 . 34424 I 1 . 34296 

I DP MFlops/s | 0. 0130224 | 0. 00561 97 3 | 0. 00593027 I 0. 00616926 

Region : Benchmark 



I INSTR_RETIRED_ANY | 1.88024e+07 | 1.85461e+07 I 1. 84947 e+07 | 1.84766e+07 

I CPI | 1. 52023 I 1. 52252 I 1. 52708 I 1. 52661 I 

I DP MFlops/s | 1624. 08 I 1644. 03 I 1643. 68 I 1645. 8 I 



Note that the INSTR_RETIRED_ANY and CPU_CLK_UNHALTED_C0RE events are 
always counted (using two nonassignable "fixed counters" on the Core 2 ar- 
chitecture), so that the derived CPI metric ("cycles per instruction") is easily 
obtained. 



2.2 likwid-pin 

Thread/process affinity is vital for performance. If topology information is 
available, it is possible to "pin" threads according to the application's re- 
source requirements like bandwidth, cache sizes, etc. Correct pinning is even 
more important on processors supporting SMT, where multiple hardware 
threads share resources on a single core, likwid-pin supports thread affinity 
for all threading models that are based on POSIX threads, which includes 
most OpenMP implementations. By overloading the pthread_create API 
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Fig. 2: Basic architecture of 
likwid-pin. 



call with a shared library wrapper, each thread can be pinned in turn upon 
creation, working through a list of core IDs. This list, and possibly other pa- 
rameters, are encoded in environment variables that are evaluated when the 
library wrapper is first called, likwid-pin simply starts the user application 
with the library preloaded. 

This architecture is illustrated in Fig. 2. No code changes are required, but 
the application must be dynamically linked. This mechanism is independent 
of the processor architecture, but the way the compiled code creates applica- 
tion threads must be taken into account: For instance, the Intel OpenMP im- 
plementation always runs DMP_NUM_THREADS threads but uses the first newly 
created thread as a management thread, which should not be pinned. This 
knowledge must be communicated to the wrapper library. The following ex- 
ample shows how to use likwid-pin with an OpenMP application compiled 
with the Intel compiler: 

$ export MP _NUM_ THREADS =4 

$ likwid-pin -c N:0-3 -t intel ./a. out 

In general, likwid-pin can be used as a replacement for the taskset 
tool, which cannot pin threads individually. Currently, POSIX threads, Intel 
OpenMP, and GNU (gcc) OpenMP are supported directly, and the latter is 
assumed as the default if the -t option is not used. A bit mask can be speci- 
fied to identify the management threads for cases not covered by the available 
parameters to the -t option. Moreover, likwid-pin can also be employed for 
hybrid programs that combine MPI with some threading model, if the MPI 
process startup mechanism establishes a Linux cpusct for every process. 

The big advantage of likwid-pin is its portable approach to the pinning 
problem, since the same tool can be used for most applications, compilers, 
MPI implementations, and processor types. In Section 3.1 the usage model 
is analyzed in more detail on the example of the STREAM triad. 
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Fig. 3: STREAM triad test run with the Intel C compiler on a dual-socket 
Intel Westmere system (six physical cores per socket). In Fig. (a) threads 
are not pinned and the Intel pinning mechanism is disabled. In Fig. (b) the 
application is pinned such that threads are equally distributed on the sockets 
to utilize the memory bandwidth in the most effective way. Moreover, the 
threads are first distributed over physical cores and then over SMT threads. 



3 Case studies 



3.1 Case study 1: Influence of thread topology on 
STREAM triad performance 

To illustrate the general importance of thread affinity we use the well known 
OpcnMP STREAM triad on an Intel Westmere dual-socket system. Intel 
Westmere is a hexacore design based on the Nehalcm architecture and sup- 
ports two SMT threads per physical core. The Intel C compiler version 11.1 
was used with options -openmp -03 -xSSE4.2 -f no-f nalias. Intel compil- 
ers support thread affinity only if the application is executed on Intel pro- 
cessors. The functionality of this topology interface is controlled by setting 
the environment variable KMP_AFFINITY. In our tests KMP_AFFINITY was set 
to disabled. For the case of the STREAM triad on these ccNUMA archi- 
tectures the best performance is achieved if threads are equally distributed 
across the two sockets. 

Figure 3 shows the results. The non-pinned case shows a large variance in 
performance especially for the smaller thread counts where the probability is 
large that only one socket is used. With larger thread counts there is a high 
probability that both sockets are used, still there is also a chance that cores 
are oversubscribed, which reduces performance. The pinned case consistently 
shows high performance throughout. It is apparent that the SMT threads 
of Westmere increase the chance of different threads fighting for common 
resources. 
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Fig. 4: Time-resolved results for the iteration phase of an MPI-parallel Lattice 
Boltzmann solver on one socket (four cores) of an Intel Nehalem compute 
node. The compute performance in MFlops/s (Fig. (a)) and the memory 
bandwidth in MBytes/s (Fig. (b)) are shown over a duration of 10 seconds, 
comparing two versions of the computational kernel (standard C versus SIMD 
intrinsics). 



3.2 Case Study 2: Monitoring the performance of a 
Lattice Boltzmann fluid solver 

To demonstrate the daemon mode option of likwid-perf ctr an MPI-parallel 
Lattice Boltzmann fluid solver was analyzed on a Intel Nehalem quad-core 
system (Fig. 4). The daemon mode of likwid-perf ctr allows time- resolved 
measurements of counter values and derived metrics in performance groups. 
It is used as follows: 

$ likwid -perf ctr -c S0:0-3 -g FL0PS_DP -d 800ms 

This command measures the performance group FL0PS_DP on all physi- 
cal cores of the first socket, with an interval of 800 ms between samples, 
likwid-perf ctr will only read out the hardware monitoring counters and 
print the difference between the current and the previous measurement. 
Therefore, the overhead is kept to a minimum. For this analysis the per- 
formance groups FL0PS_DP and MEM were used. 



3.3 Case Study 3: Detecting ccNUMA problems on 
modern compute nodes 

Many performance problems in shared memory codes are caused by an inef- 
ficient use of the ccNUMA memory organization on modern compute nodes. 
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Fig. 5: NUMA problems reproduced with likwid-bench on the example of 
a memory copy benchmark on an Intel Nehalem dual-socket quad-core ma- 
chine. The bandwidth on the top is the effective total application bandwidth 
as measured by likwid-bench itself. The other bandwidth values are from 
likwid-perf ctr measurements in wrapper mode. 



CcNUMA technology achieves scalable memory size and bandwidth at the 
price of higher programming complexity: The well-known locality and con- 
tention problems can have a large impact on the performance of multi- 
threaded memory-bound programs if parallel first touch placement is not 
used on initialization loops, or is not possible for some reason [9]. 

likwid-perf ctr supports the developer in detecting NUMA problems 
with two performance groups: MEM and NUMA. While on some architectures 
like, e.g., newer Intel systems, all events can be measured in one run using 
the MEM group, a separate group (NUMA) is necessary on AMD processors. 

The example in Fig. 5 shows results for a memory copy benchmark, 
which is part of likwid-bench. Since likwid-bench allows easy control of 
thread and data placement it is well suited to demonstrate the capabilities of 
likwid-perf ctr in detecting NUMA problems. Here, likwid-perf ctr was 
used as follows: 

$ likwid-perf ctr -c S0:08S1:0 -g MEM ./a. out 

The relevant output for the derived metrics could look like this: 

+ + h + 

I Metric I core I core 4 I 



Runtime [s] 


-+ 

I 4.71567 


f 

. 138517 


CPI 


1 16.4815 


. 605114 


Memory bandwidth [MBytes/s] 


1 6.9273 


6998 .71 


Remote Read BW [MBytes/s] 


1 0.454106 


4589 . 46 


Remote Write BW [MBytes/s] 


1 0.0705132 


2289 . 32 


Remote BW [MBytes/s] 


1 0.524619 
- + 


6878 . 78 
y 



All threads were executed on socket zero, as can be seen from the runtime 
which is based on the CPU_CLK_UNHALTED_CDRE metric. All program data 
originated from socket one since there is practically no local memory band- 
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width. Hence, all bandwidth on socket one came from the remote socket. 
Fig. 5 (a) shows the results for sequential data initialization on one socket; 
the overall bandwidth is 9.83 GB/s. Fig. 5 (b) shows the case with correct first 
touch data placement on both sockets. The effective bandwidth is 23. f 5 GB/s, 
and the scalable ccNUMA system is used in the most efficient way. If an ap- 
plication cannot be easily changed to make use of the first touch memory 
policy, a viable compromise is often to switch to automatic round-robin page 
placement across a set of NUMA domains, or interleaving, likwid-pin can 
enforce interleaving for all NUMA domains included in a threaded run. This 
can be achieved with the -i option: 

$ likwid-pin -c SO : -3 OS 1 : -3 -t intel -i ./a. out 

Figure 5 (c) reveals that the memory bandwidth achieved with interleaving 
policy, while not as good as with correct first touch, is still much larger than 
the bandwidth of case (a) with all data in one NUMA domain. 



4 Conclusion and future plans 

LIKWID is a collection of command line applications supporting performance- 
oriented software developers in their effort to utilize today's multicore pro- 
cessors in an effective manner. LIKWID does not try to follow the trend to 
provide yet another complex and sophisticated tooling environment, which 
would be difficult to set up and would overwhelm the average user with large 
amounts of data. Instead it tries to make the important functionality acces- 
sible with as few obstacles as possible. The focus is put on simplicity and 
low overhead, likwid-topology and likwid-pin enable the user to account 
for the influence of thread and cache topology on performance and pin their 
application to physical resources in all possible scenarios with one single tool 
and no code changes. The usage of likwid-perf ctr was demonstrated on 
two examples. LIKWID is open source and released under GPL2. It can be 
downloaded at http : / / code . google . com/p/likwid/. 

Future plans include applying the philosophy of LIKWID to other areas 
like, e.g., profiling (also on the assembly level). Emphasis will also be put 
on a further improvement with regard to usability. It is also planned to port 
parts of LIKWID to the Windows operating system. An ongoing effort is to 
add support for present and upcoming architectures like, e.g., the Intel Sandy 
Bridge microarchitecture. 
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